Skip to main content

File Comparator

1. Description

This binary compares two files (primary and secondary) and reports differences: missing rows, mismatches, and invalid/duplicate keys. It supports native .cf , plain text files and Parquet inputs. The tool can compare files with the same structure or perform mapped comparisons when the two sources have different schemas.

2. Supported input types and combinations

  • .cf : dynamic (.cf) and static (.cf + .peo) variants.
  • .txt (delimited text).
  • .pq (Parquet files or directories containing Parquet data).

Allowed comparison combinations:

  • cf vs cf (dynamic or static) — supported.
  • cf vs txt — supported for dynamic .cf and txt.
  • txt vs txt — supported without metadata.
  • parquet vs parquet — supported.
  • parquet vs cf / parquet vs txt — supported.

Notes:

  • For comparisons involving .txt, all fields are treated as strings.
  • When using V5 metadata resolution, the tool can load .cf metadata from a directory of JSON config files (see metadata options below).

3. How to run

Required parameters:

  • --primary-file : Path to the primary input file.
  • --secondary-file : Path to the secondary input file.
  • --output-directory : Directory where comparison outputs will be written.

Key/value mapping parameters (one of these forms is required):

  • Same-structure comparisons:

    • --keys : Comma-separated list of key columns present in both sources.
    • Optional --values : Comma-separated list of value columns to compare (If not given, compares all the columns).
  • Mapped comparisons (different schemas):

    • --primary-keys and --secondary-keys : Comma-separated key columns for primary and secondary respectively (must have same length and order).
    • --primary-values and --secondary-values : Comma-separated value columns for primary and secondary (must have same length and order).

Other options:

  • --client-mode : Optional client mode label such as V4 or V5. When V5 the program can auto-resolve .cf metadata from a config directory.
  • --primary-metadata-path : Explicit path to a metadata JSON for primary .cf input.
  • --secondary-metadata-path : Explicit path to a metadata JSON for secondary .cf input.
  • --metadata-config-directory : Directory of JSON metadata configs used to resolve .cf metadata by input file name (used when --client-mode V5).
  • --tolerance-value : Numeric tolerance used for numeric comparisons (defaults to exact match if not provided).
  • --header-count : Number of header rows to skip in text files (default 0).
  • --delimiter : Delimiter for text files. Accepts a single character, TAB, or \t. Default is |.
  • --rows-to-check : Optional limit on how many rows are read from the primary input (useful for quick checks).

Behavior notes:

  • If using --keys / --values (shared columns), these names/positions are applied to both inputs.
  • If the program cannot resolve keys or value mappings, it will fail with a helpful message.
  • Duplicate keys are treated as invalid rows and the run will fail (non-zero exit) if duplicates are present — invalid-key reports are still written to the output directory.
  • If the .cf file is a dynamic cf file, then even column positions can be given to compare instead of metadata field names.

4. Metadata handling for .cf inputs

  • You can provide explicit metadata files with --primary-metadata-path or --secondary-metadata-path (JSON describing fields and types).
  • If --client-mode V5 and --metadata-config-directory is provided, the program will try to resolve metadata automatically:
    • It searches JSON files in the directory for an input field matching the full path or filename of the .cf being compared.
    • If a single match is found, that metadata is used.

5. Output files and meaning

All outputs are written into the directory provided by --output-directory.

  • missing_in_primary.txt — rows present in the secondary file but not in primary. Each line is a display string for the missing record.
  • missing_in_secondary.txt — rows present in the primary file but not in secondary. Each line is a display string for the missing record.
  • invalid_keys_primary.txt — invalid or duplicate primary rows. Each line is of the form REASON|RECORD_DISPLAY (e.g. Empty key|... or Duplicate key|...).
  • invalid_keys_secondary.txt — same as above for secondary.
  • mismatch.txt — list of mismatched rows with a header row. There are two header variants depending on whether a TXT input is involved:
    • When a TXT input is involved, the header is:
    key|mismatch_columns|primary_record|secondary_record

- `key`: the configured key values.
- `mismatch_columns`: a bracketed list of the column labels that differ, e.g. `[colA, colB]`.
- `primary_record` and `secondary_record`: The entire record from both files is displayed with comma as delimiter.
  • When no TXT input is involved (both sources are structured, e.g. .cf), the header is:
    key|mismatch_columns|primary_record|secondary_record|missing_in_primary_cashflow|missing_in_secondary_cashflow

The extra two fields contain nested cashflow differences for cashflow-type columns (if any). Each is an encoded list like `[(interest,principal,date), ...]`. Empty if not applicable.
  • summary.txt — summary file includes:
  - Count in primary: total rows read from primary.
- Count in secondary: total rows read from secondary.
- Valid rows primary: total number of valid records in primary file.
- Valid rows secondary: total number of valid records in secondary file.
- Invalid rows primary: total number of invalid records in primary file.
- Invalid rows secondary: total number of invalid records in secondary file.
- Mismatch count: total number of records which has mismatched values.
- Missing in primary count: total number of records missing in primary file.
- Missing in secondary count: total number of records missing in secondary file.
- Distinct mismatch columns: [col1, col2, ...] — list of columns that had mismatches.

6. Common usage examples


- Basic same-structure compare (keys by name):

--primary-file primary.cf
--secondary-file secondary.cf
--keys accountId
--output-directory ./out

- Text file compare with header and comma delimiter:

--primary-file primary.txt
--secondary-file secondary.txt
--keys 1,2
--delimiter ,
--header-count 1
--output-directory ./out

- Mapped compare when columns differ:

--primary-file a.cf
--secondary-file b.cf
--primary-keys id
--secondary-keys acct_id
--primary-values balance
--secondary-values bal
--output-directory ./out

- Use a tolerance for numeric comparison (example: 0.01):

--primary-file a.cf
--secondary-file b.cf
--keys id
--tolerance-value 0.01
--output-directory ./out